Exploring 15,399 interpretable features from 72 TopK SAEs
Interpretability Rate
15,399 / 36,864 featuresMonosemantic Rate
11,187 / 15,399 featuresTotal Correlations
Avg 3.2 per featureNumber Features
Dominant feature type| Feature ID | Layer | Position | Label | Mono | Top Enrichment | Top Token | Correlations | Details |
|---|
GPT-2 (124M): 72.6% monosemantic rate, 29.3% sparsity
LLaMA (1B): ~50% estimated monosemantic rate, 19.5% sparsity
Smaller models require more specialized, monosemantic features to compensate for limited capacity.